Skip to content

[format] Support writing and reading Arrow schema metadata for file formats#8321

Open
lxy-9602 wants to merge 3 commits into
apache:masterfrom
lxy-9602:add-format-meta
Open

[format] Support writing and reading Arrow schema metadata for file formats#8321
lxy-9602 wants to merge 3 commits into
apache:masterfrom
lxy-9602:add-format-meta

Conversation

@lxy-9602

Copy link
Copy Markdown
Contributor

Purpose

This PR is a sub-PR for shared-shredding.

Shared-shredding needs to attach dictionary and other field-level metadata before closing data files. To support that flow, this PR adds a generic metadata write/read path for file formats, so upper layers can provide raw key-value metadata during writing, and readers can parse the stored metadata back when opening files.

The metadata representation follows Arrow Parquet's existing schema metadata convention: Arrow stores the original serialized schema under the ARROW:schema file metadata key, base64-decodes it on read, and deserializes it with Arrow IPC schema reading. See Apache Arrow's Parquet schema implementation, where kArrowSchemaKey is ARROW:schema and the value is base64-decoded before ReadSchema.

Related design:
https://cwiki.apache.org/confluence/display/PAIMON/PIP-43%3A+Columnar+Storage+Optimization+for+MAP+Type+in+Paimon

Brief change log

  • Add SupportsWriterMetadata so format writers can accept raw Map<String, byte[]> metadata before file close.
  • Add SupportsReaderArrowSchema so format readers can return the stored Arrow schema metadata.
  • Add FormatMetadataUtils for:
    • base64 encoding metadata values before storing them in format footers;
    • base64 decoding stored metadata values;
    • reading the fixed ARROW:schema key into an Arrow Schema;
    • extracting field-level metadata as Map<String, Map<String, String>>.
  • Support metadata writing for Parquet and ORC writers.
  • Support Arrow schema metadata reading from Parquet and ORC readers.
  • Add Parquet/ORC tests covering metadata write/read and field-level Arrow metadata roundtrip.

Compatibility

The metadata value is stored using Arrow-compatible base64 encoding. For Arrow schema metadata, the key is ARROW:schema, matching Arrow Parquet's convention.

Tests

mvn -pl paimon-format -am -Pfast-build -DfailIfNoTests=false \
  -Dtest=ParquetFormatReadWriteTest#testWriteMetadata,OrcFormatReadWriteTest#testWriteMetadata,FormatMetadataUtilsTest test

default ParquetWriter<T> createWriter(
OutputFile out, String compression, Supplier<Map<String, byte[]>> metadataSupplier)
throws IOException {
return createWriter(out, compression);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This default makes metadata support silently disappear for any ParquetBuilder implementation that only implements the original two-argument createWriter. ParquetWriterFactory still returns a ParquetBulkWriter that implements SupportsWriterMetadata, so callers can call addMetadata(...) successfully, but finalizeWrite() will never see the map and the footer will not contain the entries. Please either make support explicit (for example, fail from this overload unless the builder wires the supplier into WriteSupport.finalizeWrite) or avoid exposing SupportsWriterMetadata for builders that cannot persist it.

@lxy-9602 lxy-9602 Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I agree that silently falling back to the two-argument createWriter would make the metadata support look available while the footer entries are actually dropped, which is quite misleading.

I updated the default metadata-aware createWriter overload to fail explicitly with UnsupportedOperationException unless a ParquetBuilder implementation wires the metadata supplier into the writer path. I also added a regression test to cover a builder that only implements the original two-argument createWriter, making sure it fails explicitly instead of silently ignoring metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants